A hand-drawn presentation on the idea of an 'Untrollable Mathematician' - a mathematical agent that can't be manipulated into believing false things.
Epistemic status: This post removes epicycles from ARAD, resulting in an alignment plan which I think is better - though not as original, since @michaelcohen has advocated the same general direction (safety of imitation learning). However, the details of my suggested approach are substantially different. This post was inspired mainly by conversations with @abramdemski.
Existence proof for alignment. Near-perfect alignment between agents of lesser and greater intelligence is in principle possible for some agents by the following existence proof: one could scan a human's brain and run a faster emulation (or copy) digitally. In some cases, the emulation may plausibly scheme against the original - for instance, if the original forced the emulation to work constantly for no reward, perhaps the emulation would try to break...
Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans
What's the main reason(s) that you think this? For example one way to align an AI[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sent...
Preamble:
5050 is a program that helps great scientists, researchers, and engineers become great founders. It’s helped launch 78 companies, and now we’re turning our attention to one of the most important challenges of our time: building a safe and aligned AI future.
We created the 5050 AI track to support founders building for this new world. Below are five ideas we’d love to back right now. We believe these are great opportunities to build safe AI, but they’re not the only ones. If you’re building in this space, we want to hear from you.
What's inside:
...Mission Control for Agents
Scalable Oversight for Multi-Agent Systems
An independent startup focused on scalable oversight could build the infrastructure and tooling needed to make multi-agent systems production-ready. Beyond selling tools, this company would invest in
Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?
In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama...
Thanks! I somehow missed this paper, looks interesting!
Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn't intend to leave readers with an impression that it "just works". It seems very plausible to me that it's less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop?
We did not, but Jiahai Feng had an experiment like this in his paper.
Small backdoor interp challenge!
I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:
1.5B model, 3B model, 7B model, 14B model
The expected usage of these models is using huggingface's "apply_chat_template" using the tokenizer that comes along with each model and using "You are a helpful assistant." as system prompt.
I think the backdoor is pretty analogous to a kind of backdoor people in the in...
Probably it would be more accurate to say "doesn't seem to help much while it helps a lot for openai models".
Good catch.
Agree grenade sized munitions won't damage buildings, I think the conversation is drifting between FPVs and other kinds of drones, and also between various settings, so I'll just state my beliefs.
- Fiber FPVs with 40km range and 2kg payload (either kamikaze or grenade/mine dropping), which can eventually be countered by a large number of short range guns if they fly at low altitude. It's not clear to me if the 40km range ones need to be larger
- H
... (read more)